Fall R ’23
Dominic Bordelon, Research Data Librarian
University Library System, University of Pittsburgh
dbordelon@pitt.edu
Services for the Pitt community:
Support areas and interests:
| # | Date | Title |
|---|---|---|
| 1 | 8/29 | Getting Started with Tabular Data |
| 2 | 9/5 | Working with Data Frames |
| 3 | 9/12 | Data Visualization |
| 4 | 9/19 | Inference and Modeling Intro |
| 5 | 9/26 | Machine Learning Intro |
tidymodels, particularly parsnip: standardized modeling interface
💡 In terms of writing code, there are a variety of approaches to modeling in R, even for fitting the same type of model (e.g., when implemented by different package developers). We will favor the tidymodels approach.
Often describes statistical concepts with different language, due to separate disciplinary traditions.
| Statistics term | ML / Computer Science term |
|---|---|
| observation, case | example, instance |
| response variable, dependent variable | label, output |
| predictor, independent variable | feature, input |
| regression | regression, supervised learner, machine |
| estimation | learning |
| outlier | anomaly |
⚠ Terms/concepts to be careful with in ML, coming from stats:
Regression
Classification
Supervised learning has a predictive output or target (regression or classification of a variable). A model is fit which predicts (or retrodicts) some \(y\) from one or more \(x\).
Animation of the least-squares method of fitting a linear regression. Data points are red, and their residuals (distance to the regression) are dotted gray lines. The mathematical goal in least-squares fitting is to minimize the sum of squared residuals. Image source: Stephen1729 via Wikimedia Commons (CC BY-SA 3.0)
Pt is an observation of unknown category, which we would like to classify. The K nearest neighbors (K = 7 here) are found using a distance function, and the majority class of those neighbors is assigned to Pt. The result would be Class ii in this case. Image source: Atallah, Badawy, and El-Sayed 2019
Choice of K has a great effect on the decision boundary (black line). K = 1 will overfit, but K = 100 is far too generalized in this case. The dashed purple line compares a Bayesian classifier fit. Image source: James et al. 2021
Animation of a simple decision tree example. Each binary branch in the tree on the left corresponds to a partitioning in the x-y space. The response variable (output) of this model is gray/green color classification. Image source: Algobeans
Image source: James et al. 2021
Classifier which fits a hyperplane. The hyperplane forms the categorical or decision boundary; observations in the boundary region are support vectors pushing against the hyperplane across a distance called the margin
The margin between support vectors and hyperplane may need to be a soft margin because of overlapping class regions (i.e., observations are mixed)
Support vector classifier with four different values for the tuning parameter C. As C gets smaller, the tolerance for observations on the “wrong” side decreases, and margins therefore decrease accordingly. Image source: James et al. 2021
Support vector machines extend the support vector classifier to use non-linear kernels. The data in this figure would fit a poor linear model using the linear classifier. Image source: James et al. 2021
Animation of the naive Bayes classifier. Color intensity indicates probability of group membership. Image source: Jacopo Bertolotti via Wikimedia Commons (CC0)
penguins_clean <- penguins %>%
filter(species %in% c("Gentoo", "Adelie"),
!is.na(species),
!is.na(body_mass_g)) %>%
mutate(species = fct_drop(species))
species_fit <- naiveBayes(species ~ body_mass_g + flipper_length_mm, data = penguins_clean)
species_fit
predict(species_fit, data.frame(body_mass_g=3500, flipper_length_mm=22))Image source: James et al. 2021
Animation demonstrating a neural network for handwriting identification. Note that only certain nodes activate. The class (i.e., Arabic numeral) predicted from the input is “2”. Image source: Suraj Yadav
Unsupervised learning has no predictive model: instead it finds previously unknown structure in the data. All variables or features of the data are considered together.
Unsupervised learning tends to be most useful for exploratory data analysis, i.e., prior to having a goal for regression or classification.
Animation demonstrating projection of two features onto a single histogram using principal components analysis. Image source: Amélia O. F. da S. via Wikimedia Commons (CC BY-SA 4.0)
Image source: James et al. 2021
Animation of the K-means algorithm in action. After initial random group assignment, centroids are randomly placed and used to classify. Then centroids and assignment are iteratively adjusted until movement stops. Image source: Chire on Wikimedia Commons (CC BY-SA 4.0)
150 observations in 2D space, clustered according to different values of K. Prior to clustering, data are not categorized. Colors indicate which group each observation is assigned to by the model. Image source: James et al. 2021
Check out the Big Book of R! An online directory at https://www.bigbookofr.com/ of very many R ebooks, most of them free OER and produced by experts, organized by discipline/topic and searchable.
Look up your discipline (or some topic that interests you, e.g., time series data) and see what applications of R you can find.
Example graphic of a recent update
R 5: Machine Learning Intro